Short Text Authorship Attribution via Sequence Kernels, Markov Chains and Author Unmasking: An Investigation
نویسندگان
چکیده
We present an investigation of recently proposed character and word sequence kernels for the task of authorship attribution based on relatively short texts. Performance is compared with two corresponding probabilistic approaches based on Markov chains. Several configurations of the sequence kernels are studied on a relatively large dataset (50 authors), where each author covered several topics. Utilising Moffat smoothing, the two probabilistic approaches obtain similar performance, which in turn is comparable to that of character sequence kernels and is better than that of word sequence kernels. The results further suggest that when using a realistic setup that takes into account the case of texts which are not written by any hypothesised authors, the amount of training material has more influence on discrimination performance than the amount of test material. Moreover, we show that the recently proposed author unmasking approach is less useful when dealing with short texts.
منابع مشابه
Short Text Authorship Attribution via Sequence Kernels, Markov Chains and Author Unmasking: An Investigation
We present an investigation of recently proposed character and word sequence kernels for the task of authorship attribution based on relatively short texts. Performance is compared with two corresponding probabilistic approaches based on Markov chains. Several configurations of the sequence kernels are studied on a relatively large dataset (50 authors), where each author covered several topics....
متن کاملAuthorship Attribution in Modern Hebrew In partial fulfillment of requirements for
This thesis deals with a text classification problem: the identification of the author of a text by its style. Given a text whose author is unknown, and a set of candidates with sample texts, we need to find the true author of the text. The authorship attribution problem has usages in the humanities, in forensic linguistics and in intelligence. The corpora on which this study was done are writt...
متن کاملAuthorship Attribution of Micro-Messages
Work on authorship attribution has traditionally focused on long texts. In this work, we tackle the question of whether the author of a very short text can be successfully identified. We use Twitter as an experimental testbed. We introduce the concept of an author’s unique “signature”, and show that such signatures are typical of many authors when writing very short texts. We also present a new...
متن کاملCross-Genre Authorship Verification Using Unmasking
This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date...
متن کاملThe effect of author set size and data size in authorship attribution
Applications of authorship attribution ‘in the wild’ [Koppel, M., Schler, J., and Argamon, S. (2010). Authorship attribution in the wild. Language Resources and Evaluation. Advanced Access published January 12, 2010:10.1007/ s10579-009-9111-2], for instance in social networks, will likely involve large sets of candidate authors and only limited data per author. In this article, we present the r...
متن کامل